Skip to content

Enabling cooperative multi-gpu tests on multi-gpu nodes#27986

Merged
gshtras merged 2 commits intovllm-project:mainfrom
Alexei-V-Ivanov-AMD:MAIN_20251103
Nov 5, 2025
Merged

Enabling cooperative multi-gpu tests on multi-gpu nodes#27986
gshtras merged 2 commits intovllm-project:mainfrom
Alexei-V-Ivanov-AMD:MAIN_20251103

Conversation

@Alexei-V-Ivanov-AMD
Copy link
Copy Markdown
Collaborator

@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD commented Nov 3, 2025

Enabling cooperative multi-gpu tests on multi-gpu nodes.

Signed-off-by: Alexei V. Ivanov alexei.ivanov@amd.com

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables cooperative multi-gpu tests by adding the container user to the render group for GPU device access and by removing the hardcoded HIP_VISIBLE_DEVICES=0 to allow tests to use multiple GPUs. The changes are correct in principle. However, I've pointed out a high-severity issue regarding the robustness of how the render group GID is obtained. My suggestions aim to make the script more robust and easier to debug by adding explicit checks and centralizing the logic, which also eliminates code duplication.

@@ -186,6 +186,7 @@ if [[ $commands == *"--shard-id="* ]]; then
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--group-add $(getent group render | cut -d: -f3) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The command substitution $(getent group render | cut -d: -f3) can produce an empty string if the render group does not exist on the host, causing the docker run command to fail with a confusing error. Since this logic is duplicated in the else block, it's best to perform the check once before the if/else block, store the GID in a variable, and reuse it.

This improves robustness, provides clearer error messages, and avoids code duplication. For example:

# Before the if-else block
render_gid=$(getent group render | cut -d: -f3)
if [[ -z "$render_gid" ]]; then
  echo "Error: 'render' group not found. This is required for GPU access." >&2
  exit 1
fi

# ... then inside both docker run commands:
--group-add "$render_gid" \

@@ -217,8 +218,8 @@ else
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--group-add $(getent group render | cut -d: -f3) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This duplicates the logic for getting the render group GID. As suggested in my other comment, this should be done once before the if/else block to improve robustness and avoid duplication.

@gshtras gshtras added rocm Related to AMD ROCm ready ONLY add when PR is ready to merge/full CI is needed labels Nov 3, 2025
@Alexei-V-Ivanov-AMD Alexei-V-Ivanov-AMD force-pushed the MAIN_20251103 branch 9 times, most recently from 1257540 to b570a7c Compare November 5, 2025 11:05
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
@gshtras gshtras merged commit 80c9275 into vllm-project:main Nov 5, 2025
16 checks passed
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
…#27986)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…#27986)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants